(2001b). The free 
t£ in explicit ^vatcr. 
/ Sdmas USA 98, 

*XRt B.J. (200U). 
for me with E^ald 
molecular systems. 
358. 

aturc equation of 
, Nonpolar gases. 
i26. . 



1 • 



^f^^^ Ofa ° phySics3<,3(2003 >-PP 307 - 34a © 2003 Cambridge Univertty Press w 
DOI: IO.IOI7/S003358350300390I Printed in the United Kingdom * "rwOTrty ftess 30 7 



Prediction of protein function from protein 
sequence and structure 

James C Whisstock' and Arthur M. Lesk 1,2 * 

Microbial Genetics, Monwh University, Clayton. Victoria, Australia ™ ana functional 



Abstract. The sequence of a genome contains the plans of the possible life of an organism 
but implementation of genetic information depends on the functions of the proteins and 
nucleic adds that ft encodes. Many individual proteins of known sequence and structure 
present challenges to the understanding of their function. In particular, a number of genes 
responsible for diseases have been identified but their specific functions are unknown Whole- 
genome sequencing projects, are a major source of proteins of unknown function. Annotation 
of a genome involves assignment of functions to gene products, in most cases on the basis of 
- am,no-ac.d sequence alone. 3D structure can aid the assignment of function, motivating the 
challenge of structural genomics projects, to make structural information available for novel 
uncharactenzed proteins. Structure-based identification of homologues often succeeds where 
sequence-alone-based methods fail, because in many cases evolution retains the folding 
pattern long after sequence similarity becomes undetectable. Nevertheless, prediction of 
protejn function from sequence and;structure is a difficult probW, because homologous 
proteins often have different functions. Many methods of function prediction rely on identifying 
similanty in sequence and/or structure between a protein of unsown function and one or 
more well-understood proteins. Alternative methods include inferring conservation patterns in 
members of a functionally uncharactenzed family for which many sequences and structures are 
known. However, these inferences are tenuous. Such methods provide reasonable guesses at 
function, but are far from foolproof. It is therefore fortunate that the development of whole- 
organism approaches and comparative genomics permits other approaches to function 
prediction when the data are available. These include the use of protein-protein interaction 
patterns, and correlations between occurrences of related proteins in different organisms as 
indicators of functional properties. Even if it is possible to ascribe a particular function to a 
gene product, the protein may have multiple functions. A fundamental problem is that function 
is in many cases an ill-defined concept In this article we review the state oRbe art in function 
prediction and describe some of the underlying difficulties and successes. 
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U Introduction 

Much of the evolution of living systems on the molecular level proceeds according to the 
cascade: 

gene sequence determines amino-acid sequence, 
amino-acid sequence determines protein structure, 
protein structure determines protein function, 

selection acts on function to modify allele frequencies in populations (to close the loop). 

Genome sequencing projects produce the full DNA sequences of organisms. Identification of 
genes within genomes provides the amino-acid sequences of the organism's proteins. In struc- 
tural genomics projects, X-ray crystallography and NMR spectroscopy aim to determine the 
structures of a subset of the proteins from which other structures can be predicted by homology 
modelling. Contemporary bioinformatics collects data on sequences, structures, and functions, 
and studies the correspondences between them (for general references, sec Galpcrin & Koonin, 
2002; Lesk, 2001, 2002). 

Assignments of function arc based cither solely on amino-acid sequences (the most common 
situation that arises frequently in annotating newly sequenced genomes), on some combination 
of sequence and structure, or on some organism-wide datn, if available, such as protein-protein 
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interaction patterns (von Mcring et aL 2003). The problem of predicting protein function arises in 
two general contexts. (1) The interest of a research group may be focussed on a gene and its 
protein product, and the group may pursue its investigation in detail; such studies may include 
identification of cofactors and post-translational modificadons, and even a structure determi- 
nation and a check of the phenotypic effect of a knockout The result is an attempt to assign 
function on the basis of a thick dossier of detailed information. In the past this was the paradigm. 
(2) However, with increasing frequency we must deal with much sparser information. The largest 
sources of proteins of unknown function are complete genome sequences, giving us the chal- 
lenge of annotating them (Smith, 1998; Eisenberg etaL 2000; Stein, 2001 a Thornton, 2001). In 
these cases the data about specific proteins in genomes are often limited to their amino-acid 
sequences. The goal of providing at least approximate structural information, for its implications 
about function, is an important motivation of structural genomics projects (Buticy etaL 1999; 
Eisenstein etaL 2000; Skolnick etaL 2000; Brenner, 2001 ; Chance eta/. 2002; Giliiland etaL 2002; 
Zhang & Kim, 2003). 

In analysing a novel genome, how well do we understand Nature's rules in proceeding from 
DNA sequence to amino-acid sequence to protein structure to function? 

• Starting from a genome sequence, gene identification is still problematic, especially in 
cukaryotcs where alternative splicing patterns compound the difficulty (Novichkov etaL 2001 ; 
Jones etaL 2002). _ .... - 

• The next step is perhaps the safest: Based originally on the experiments of Anfinsen 
demonstrating the reversible denaturation of proteins, we know that Nature has strict rules 
for determining protein structure uniquely from amino-acid sequences. There arc a few 
exceptions - notably^ the prion proteins (Cohen & Prusiner, 1998; Pcrcv/.e/ aL 2002), and 
the scrpins (\Vhisstock 1998:; Gcttins, 2002 ; Pike etaL 2002) but this generalization is : 
among the most robust we have in the field. {Chaperoncs arc only catalytic' in this process,, 
not containing any information specific to the folding of any particular protein,) Although as 
yet wc do not understand the physical basis of Nature's folding algorithm in sufTtcient detail to 
predict structure from sequence, progress is being made (Schonbrun et aL 2002; Tramontano, 
2003). Moreover, the observation that similar sequences determine similar structures (the 
'differential form ' of the folding problem) gives us general confidence in homology modelling. 
Much less reliable is the widely held assumption that proteins with very similar sequences 
should - by virtue of their vet)- similar structures - have similar functions. 

• To reason from sequence and structure to function is to step onto much shakier ground. 
Following the reasoning of the previous paragraph, a common way to try to assign function to 
a protein is to identify a putative homologuc of known function and guess that both share a 
common function. It is indeed true that many families of proteins contain homologucs with 
the same function, widely distributed among species; for these, reasoning from homology 
docs assign function correctly; However, the assumption that homologues share function is 
less and less safe as the sequences progressively diverge. Moreover, even closely related pro- 
teins can change function, cither through divergence to a related function or by recruitment 
for a very different function (Ganfornina & Sanchez, 1999). In such cases, assignment of 
function on the basis of homology, in the absence of direct experimental evidence, will give 
the wrong answer, leading to misannotations in databanks. Many authors have called attention 
to 'howlers' in annotation (Smith & Zhang, 1997; Bork et al 1998; Bork & Koonin, 
1998; Doerks et aL 1998; Karp, 1998; Brenner, 1999; CODATA Task Group, 2000; 
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Dcvos & Valencia, 2000, 2001 ; Gcrlt & Babbitt, 2000; Jeong & Chen, 2001). Iyer etal (2001) 
have collected cases in which prediction and experiment agree, but both are likely to be wrong! 
Indeed, the situation can be even worse. An often-asked question is: 'How much must a 
protein change its sequence before its function changes?' The answer is: 'Not at all!' There' 
are numerous examples of proteins with multiple functions: 

(1) Eye lens proteins in the duck arc identical in sequence to active lactate dehydrogenase and 
enolasc in other tissues, although they do not encounter the substrates in the eye. They have 
been recruited to provide a completely unrelated function based on the optical properties of 
their assembly. Several other avian eye lens proteins are identical or similar to enzymes. In 
some cases residues essential for catalysis have mutated, proving that the function of these 
proteins in the eye is not an enzymic one (Wistow & Piatigorsky, 1987). Note that the 
coexistence in some species of mutated inactive enzymes in the eye, and active enzymes in 
other tissues, implies that the gene must have been duplicated. 

(2) Certain proteins interact with different partners to produce oligomers with different func- 
tions. In Escherichia co/r\ a protein that functions on its own as lipoate dehydrogenase is also an 
essential subunit of pyruvate dehydrogenase, 2-oxoglutarate dehydrogenase and the glycine 
cleavage complex (Riley, 1997). 

(3) Proteinase do functions as a chaperone at low temperatures and as a proteinase at high 
temperatures. The logic, apparently, is that under conditions of moderate stress it attempts to 
salvage misfolded proteins; under conditions of higher stress it 'gives up' and recycles them 
(Spiess etai 1999). 

(4) Phosphoglucose isorrierase (=neuroleukin = autocrine motility factor = differentiation and 
maturation mediator) functions as a glycolytic enzyme in the cytoplasm, but as a nerve 
growth factor and cytokine outside the cell (Jeffery, 1999"; Jeffery ctaL 2000)7Thc structural 

'I origin or the extracellular receptor function is obscure. 

These cases imply that even if detailed studies ofthc classical biochemical type on isolated proteins 
in dilute salt solutions do identify a function, we cannot be sure that we know the molecule's full 
repertoire of biological activities. 

Conversely, non-homologous proteins may have similar functions. Chymotrypsin and sub- 
tilisin, two proteinases that even share a common Ser-His-Asp catalytic triad, are not homolo- 
gous, and show entirely different folding patterns (Fig. 1). They are a standard example of 
convergent evolution. The Ser-His-Asp triad also appears in other proteins, including lipases and 
a natural catalytic antibody.. This and other examples show that it is not possible to reason that 
if two proteins have different folding patterns they must have different functions. 

In summary (see Fig. 2) : 

• Similar sequences produce similar protein structures y with divergence in structure increasing pro- 
gressively with the divergence in sequence (Chothia & Lesk, 1986). . 

• Conversely, similar structures an often found with very different sequences. For instance, many proteins 
form TIM barrels with no easily detectable relationship between their sequences (Copley &c 
Bork, 2000; Nagano et al 2002). 

• Similar sequences and structures sometimes produce proteins with similar functions \ but exceptions abound 
(Ponting, 2001 ; an extensive table appears in Rost, 2002). 

• Conversely, similar functions art often canied out by proteins nith dissimilar structures-^ examples 
include the many different families of proteinases, sugar kinases, and Iysyl-tRNA synthetases 
(Doolittle, 1994; Galperin et al 1998). 
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Chymotrypin [5chaJ 



Subtilisin BPN ' [5sic] 



Fig. 1. Chymotrypsih and subtilisin arc both proteinases. Although they have entirely diflerent folding 
patterns, they share a common mechanism, including the catalytic triad Ser-His-Asp. The similarity of 
function and mechanism has arisen by convergent evolution. 




Fig. 2. Organization of the spaces of protein sequences, structures and funcdons. Thin solid outlines: 
similar sequences produce similar structures, but not all similar structures have recognizably similar se- 
quences. Broken outlines: proteins of different sequence and structure can share similar funcdons. Thick 
solid oudines: conversely, proteins of similar sequence and structure can show different funcdons. 

Because evolution has so assiduously pushed the limits in its exploration of sequencers tructu re- 
function relationships; many procedures described in the literature on function prediction do not 
specify function exactly, but do provide general hints. For instance, a. protein known to be TIM 
barrel is likely to be a hydrolytic enzyme. Such hints arc very useful in guiding experimental 
investigations of function, and indeed a sufficient accumuladon of hints - based on sequence, 
structure, genomics, and interaction patterns - may well allow an expert to make a reasonable 
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proposal of a specific function, However, such an approach, relying as it does on human ex- 
pertise, is difficult to automate for high- throughput full-genome analysis. 

Two examples from the Haemophilus influenzae structural genomics project illustrate the point. 
High-resolution crystal structures of the proteins HI1434 .(Zhang et aL 2000) and of HI1679 
(Parsons et at. 2002) have been determined. Hypothesis of function from evolutionary relation- 
ships and detailed examination of the structures, followed by experimental verification of correct 
functional assignment, was successful in the case of HI1679 but undl now have not yet proved 
successful for HI1434. 

HI1679 has an a//J-hydrolase fold, with putative remote homology, based on sequence 
analysis, to members of the L-2-haloacid dehydrogenase family, the P-domain of Ca 2 * ASPase 
and phosphoserine phosphatase. It was the first structure of a protein in the L-2-haloacid de- 
hydrogenase family to be determined, and one of the motives for selecting it for investigation 
was the goal of learning about the structure and the mechanism of function of this family. The 
structure was consistent with a phosphatase, arid this was confirmed by trying a variety of 
potential substrates. The protein cleaved 6-phosphogIuconate and phosphotyrosinc, confirming 
it to be a phosphatase. Addressing the original goal of elucidating the functions of this family of 
proteins, observed substrates were modelled into the binding pocket to supply suggestions about 
how sequence variation in the active site might affect specificity (Parsons et aL 2002). 

HI1434 is related to a region in tRNA synthetases. The structure showed a putative binding 
site, a cleft that was conserved in the modelled structures of homologies. The structure itself and 
its evolutionary relationships suggest that it binds a nucleotide in its cleft. However, in this case 
no specific ligand has so far been identified. 

View it in these terms: Inferring protein function from knowledge of the function of a close 
hbrtiploguc is like solving the clue of an American crossword puzzle. Finding the word that; 
satisfies the definition rriay be difficult but the task is in principle straightforward. Working out 
the 'function of a protein from its sequence and structure is like solving the clue of a British 
cro&word puzzle. It is by rio means obvious which features of the definition are providing the 
reaVdues, as opposed to misleading ones. Also, for both types of puzzle and for the suggestion 
of a protein function, even if your answer appears to fit it may be wrong. 

2. Plan of this article 

Our goal is to review methods that have been proposed for prediction of protein function from 
amino-acid sequence and three-dimensional (3D) structure, and, as far as possible, to evaluate 
them. However, it is difficult to state criteria for successful prediction of function, since function 
is in principle a fuzzy concept. Given three sequences, it is possible to decide which of the three 
possible pairs is the most closely related. Given three structures, methods are also available to 
measure and compare the similarity of the pairs. However, in many cases, given three protein 
functions, it would be more difficult to choose the pair with the most similar function. For 
example, although it is possible to define metrics for quantitative comparisons of different 
protein sequences and structures, this is more difficult for different protein functions. 

Comparisons of functions could be based on suggested classifications of functions. There arc 
many such classifications (recently reviewed by Ouzounis et aL 2003). Probably the most widely 
known is the Enzyme Commission (EQ scheme, limited of course to that class of functions. 
Other protein function classification schemes have been proposed, many in connection with 
individual organisms or individual families of proteins. However, a scheme appropriate for one 
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Prediction of protein function 313 

organism is not necessarily appropriate for others, and until recently there has been no noticeable 
attempt at consistency. 

- Indeed, even for very well understood proteins, there are different legitimate points of view 
about what aspects of function to focus on. The biochemist looks for the process mediated by 
the isolated protein in dilute solution. The molecular biologist looks for the significance, in the 
overall scheme of the life of the cell, of the process or processes in which the protein participates. 
We describe and compare various schemes for classifying protein function and ask whether it is 
possible to reconcile the different points of view. We also suggest that the Gene Ontology 
Consortium offers the most attractive approach. 

If we had a classification of protein functions, we would want to map it onto classifications of 
sequence and structure. Classifications of sequence and structure are available, based on evol- 
utionary principles. Therefore, to work with an appropriate classification of function, it is useful 
to understand how evolving proteins develop different functions. After developing this as 
background, wc describe and classify the various methods that have been used to predict protein 
function and annotate genomes. 



3, Natural mechanisms of development of novel protein functions 

Information available about how proteins alter existing functions or develop new ones is 
abundant, although most of it is more anecdotal than systematic. Observed mechanisms of 
protein evolution that produce altered or novel functions include: (1) Divergence, (2) Recruit- 
ment and (3) /Mixing and matching* of domains. 

3.1 Divergence - ^ , ; 

' • ; , ; . ; J ;* ; - 

: In. families of closely related proteins, mutations usually conserve function but modulate speci- 
ficity* For example, the.; trypsin family of serine proteinases^ contains a specificity pocket : a 
surface cleft complementary in shape and charge distribution to the side-chain adjacent to the 
scissile bond. Mutations tend to leave the backbone conformation of the pocket unchanged but 
to affect the shape and charge of its lining, altering the specificity. 

The change in specificity of the proteases illustrates a common theme: Although homologous 
proteins show a general drifting apart of their sequences as they accumulate mutations, often a 
few specific mutations account for functional divergence (Golding ec Dean, 1998), as initially 
proposed by Pcrutz (1983) for haemoglobin. The malate and lactate dehydrogenase (MDH/ 
LDH) family is a good example. Malate and lactate dehydrogenases arc related enzymes cata- 
lysing related reactions. Wilks tt <il (1988) showed by site-directed mutagenesis that a single 
residue change could switch the activity. Their paper may have been read by a trichomonad, 
which developed an MDH that, in a family tree of these enzymes, is much more similar to LDH 
molecules than to other MDHs.and appears to have arisen by convergent evolution (\Vu et a/. 
1999). 

The TIM-barrcl structure, or very similar variants, has now appeared in over 100 enzymes of 
known crystal structure (Fig. 3). In many cases the sequence similarity is so low that it is im- 
possible to say whether the proteins are genuinely related, or whether evolution has discovered 
this very stable and useful fold more than once. Conversely, certain enzymes sharing the TIM- 
barrel fold, and which are similar enough for us to be confident of their homology, clearly show 
the divergent evolution of new functions (Copley <8c Bork, 2000; Anamhnraman et ai 2003). 
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Fig, 3. Spinach glycolatc oxidase, one of the many enzymes with the TIM-barrel structure. In this view into 
the barrel, the molecule is orientated with the C-termini of the strands nearer the viewpoint. This is the side 
of the barrel that in most cases carries the active site. 

The enolasc superfamily, which exhibits a folding pattern very closely related to TIM-barrcl, 
contains several enzymes that catalyse different reactions with shared features of their mechan- 
isms (HnssonV/*/ 3998). These include enolasc itself, mandclate raccmasc, muconatc lactonizing 
enzyme I, and r>glucaratc dehydratase. From the point of view of sequence similarity, these 
enzymes arc fairly close relatives. Mandclate raccmasc and muconate lactonizing enzyme 
I have 25% sequence identity. However, looking only at sequence and structure runs the 
risk of overlooking a more subde similarity. What these enzymes share is a common feature 
of their mtchamsm. Each acts by abstracting a proton adjacent to a carboxylic acid to form 
tin cnplatc intprmcdiatc (Fig. 4). The stabilization of a negatively charged transition state is 
conserved. In contrast,., the subsequent reaction pathway, and the nature of the product, 
vary from enzyrne to enzyme. These enzymes havc.not only a similar overall structure, a vari' 
ant ofj thc ^barrel ,:fbld, but each requires a divalent metal ion, bound by structurally 
equivalent ligands. Different residues in the active site produce enzymes that catalyse different 
reactions. 

An aspect of divergence important for its implications about function is the distinction be- 
tween orthologucs and paralogues. Any two proteins that arc related by descent from a common 
ancestor arc homologies. Two proteins in different species descended from the same protein in 
an ancestral species arc orthologucs. Two proteins related via a gene duplication within one 
species (and the respective descendants of the duplicates) are paralogucs. After gene duplication, 
one of the resulting pairs of proteins can continue to provide its customary function, releasing 
the other to diverge to develop new functions. Therefore inferences of function from homology 
are more secure for orthologucs than for paralogucs. 

The database, Clusters of Orthologous Groups (COGs), is a collection of proteins encoded in 
fully sequenced genomes, organized into families (Natalc et ai 2000). The COGs database has 
been applied to analysis of function and genome annotation. 

Comparative analyses of known structures in such families of enzymes illustrate the kinds 
of structural features that change and those that stay the same. In some cases, the catalytic 
atoms occupy the same positions in molecular space, although the residues that present 
them arc located at different positions in the sequence. In other cases the positions in space of 
the catalytic residues arc conserved even though the identities and functions of the catalytic 
residues vary. In these cases, there appears to be a set of conserved 'functional positions' relative 
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Fig; 4. Common mechanism in the cnolasc family of enzymes:, (a) mandclatc raccmase, {b) muconate 
lactonizing enzyme, (<*) enolase. (After Hnsson >/ 1998.) 

to the molecular framcwork^V^en functional residues are conserved in this way, in the structure 
if not necessarily in the sequence, they can provide a signal from which we can recognize 
function. 

However, several enzyme families show an even greater degree of divergence, including 
variation in the residues responsible for mediating catalysis. For example, the Apurinic/Apyri- 
midinic endonucleasc superfamily is a large diverse family of phbsphoesterascs. The family 
includes members that cleave nucleic acids (both DNA and RNA). However, the family has 
diverged to include lipid phosphatases. The essential catalytic residues vary between different 
subfamilies, for example, an essential His in the DNA repair enzyme DNasel is not conserved in 
exonucleasc III. In these cases, the conservation patterns from which we could hope to identify 
function have disappeared. 

In some cases very large divergence has led to very different function. Murzin (1998) and 
Grishtn (2001) have discussed how far divergence can push the relationships between homology, 
structure and sequence divergence, and functional change. Some changes in folding pattern, or 
topology, associated with functional changes, arc: 

(1) Addition/dcletion/substitution of secondary structural elements. A dramatic example is the 
relationship between lucifcrase and a non-fluorescent flavoprotein, which, although they, 
have 30% sequence identity show a standard TIM barrel in the. case of luciferase but a 
truncated barrel in the non-fluorescent flavoprotein. 
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(2) Circular permutation. An example is NK-lysin, an ali-a protein, and an aspartk proteinase 
prophytepsin. 

(3) Stand invasion and withdrawal. Although insertion of strands at the end of a /?-shcet is 
relatively simple, it is more difficult to insert a strand into a /J-barTel, Lipocalins include the 
homologucs retinol-binding protein with an 8-stranded /?*barrel, and rcrinoic acid-binding 
protein with a 10-strandcd /^-barrel. 

(4) Changing the topology while maintaining the architecture. Atromonas aminopeptidase and 
carboxypeptidasc G 2 have a common core of secondary structural elements, but are 'wired 
up' by connecting loops in a different way. The thrombin inhibitor triabin is likely to be 
related to the lipocalins, on the basis of similarities in the amino-acid sequences- Both contain 
^-barrel folds, but superposing the structures shows that two of the strands have been 
swapped. 

3.2 Recruitment 

The application of enzymes as lens crystallins illustrated another route of evolution: a novel 
function prtctdbtg divergence. It is more difficult to distinguish divergence and recruitment than it 
might first appear. Divergence and recruitment are at the ends of a broad spectrum of changes in 
sequence and function. Apart from cases of 'pure* recruitment such as the duck eye lens proteins 
or phosphoglucose isomcrasc, in which a protein adopts a new function with no sequence 
change at all, there arc examples, not only of relatively small sequence changes correlated with 
very small function changes (which most people would think of as relatively pure divergence), 
and relatively small sequence changes with quite large changes in function (which most people 
would think of as recruitment); but also many cases in which there arc large changes in both 
sequence and function! ■ i,. 

3-3 -Mixing and matching' of domains, including duplication/oligomenzation, and domain 
swapping or fusion 

Many large proteins contain tandem assemblies of domains which appear in different contexts 
and orders in different proteins. (The reader must be aware that there is no universal agreement 
about how to define a domain or a module; one traditional definition is that a domain is a 
compact subunit of a protein that looks as if it should have independent stability. Some authors 
refer to a compact unit as a module, and reserve the term domain for a unit that stays together as 
an evolutionary unit, appearing in partnership with different sets of other domains, or in different 
orders along the chain. These authors describe the serine protease structure as a single domain 
comprising two modules.) The giant muscle protein titin contains a long concatenation of up to 
about 300 modules each of which is homologous to either an immunoglobulin supcrfamily 
domain or a fibronectin III domain (Kenny e/ ai 1999). Titin is an. extreme example; most 
modular proteins contain only a few. 

Censuses of genomes suggest that many proteins arc multimodular. Scrrcs tt ai (2001) report 
that of 4401 genes in E. colt, 287 correspond to proteins containing 2, 3 or 4 modules. Tcichmann 
ff aL (2001 b,c) have analysed, for enzymes involved in metabolism of small molecules, the 
distribution and redistribution of domains. The structural patterns of 510 enzymes could be 
accounted for in total or in part by 213 families of domains. Of the 399 which could be entirely 
divided into known domains, 68% were single-domain proteins, 24% comprised two domains, 



Mo+oriol mov/ ho nrnlorfoH h\/ r*nnv/rinM Iqva/ /Tiflo \ 7 I I C C* r\r\a\ 



aspartic proteinase 

:nd of a /3-sheet is 
localins include the 
ttinoic acid-binding 

minopeptidase and 
uits, but are 'wired 
abin is likely to be 
mccs. Both contain 
strands have been 



evolution: a novel 
recruitment than it 
trum of changes in 
:kcyc lens proteins 
with no sequence 
>cs correlated with 
• pure divergence), 
vhkh most people 
;c changes in both 



and domain r.- U 

different contexts 
nivcrsal agreement 
hat a domain is a 
ility. Some authors 
at stays together as 
xins, or in' different 
is a single domain 
catenation of up to 
•bulin superfamily 
,ic example; most 

Hal (2001) report 
>duJes,Tcichmann 
all molecules, the 
:nzymcs could be 
1 could be entirely 
ised two domains, 



Prediction of protein function 317 
Table 1. General classification oj % protein functions (Andrade et & 1999) 
Energy 

• Biosynthesis of cofactors, amino acids 

• Central and intermediary metabolism 

• Energy metabolism 

• Fatty adds and phospholipids 

• Nucleotide biosynthesis 

• Transport 

Information 

• Replication 

• Transcription 

• Translation 

Communication and regulation 

• Regulator)' functions 

• Cell envelop e/ccll wall 

• Cellular processes 



and 7% three domains. Only 4 of the 399 had 4, 5 or 6 domains. Tcichmann etaL (2001b, c) also 
showed that there arc marked preferences for pairing of different families of domains. 

Thus multi-domain proteins present particular problems for functional annotation, because 
domains may possess independent functions, modulate one another's function, or act in concert 
to provide a single function. However, in some cases the presence of a particular domain or 
combinations of domains is associated with a specific function. For example, NAD-binding 
domains appear almost exclusively in dehydrogenases. 

4, Classification schemes for protein functions 

4J General schemes : - 1 >. ■:, : ; 

Several schemes for classification of protein functions have been proposed. We begin with some 
fairly general categories. 

Andrade et at. (1999) distinguished the functional classes of proteins involved in energy, 
information, and communication and regulation. Within these general classes they offered the 
subdivisions shown in Table 1. These categories comprise fairly general activities rather than 
individual protein functions. For example, biosynthesis of an amino acid often involves a 
sequence of reactions catalysed by unrelated enzymes. Despite the differences in the precise 
function of these enzymes and in their structure and mechanism, all would fall into a single class 
in this scheme. 

Other classifications have appeared in connection with genome sequencing projects. It is 
interesting to compare an analysis of functional categories suggested for a prokaryotic (E. coli) 
{Table 2) with those suggested for a cukaryotc {Saccharomjces cemisiae) (Table 3). 

There is a good deal more overlap in these two schemes than first appears. The E. coli classes 
contain a much more precise subdivision of metabolic reactions than the yeast scheme. Perhaps 
this is an example of the differences in point of view among biochemistry, molecular biology and 
cell biology. Nevertheless, for purposes of annotating a genome, most people would hope for 
more specific assignments of function than any of these categories. Note also that the different 
functions of phosphoglucose isomerase, which is also a ncurolcukin, an autocrine motility factor, 
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Table 2. Functional groups of proteins for E. coli (B/attner ct zl 1997) 

Regulatoty function 
Putative regulator)' proteins 
Cell structure 

Putative membrane proteins 
Putative structural proteins 
Phage, transposons, plasmids 
Transport and binding proteins 
Putative transport proteins 
Energy metabolism 

DMA replication, recombination, modification, and repair 

Transcription, RNA synthesis, metabolism, and modification 

Translation, post-translational protein modification 

Cell processes (including adaptation, protection) 

Biosynthesis of cofactors, prosthetic groups, and carriers 

Putative chaperoncs • 

Nucleotide biosynthesis and metabolism 

Amino acid biosynthesis and metabolism 

Fatty acid and phospholipid metabolism 

Carbon compound catabolism 

Central intermediary metabolism 

Putative enzymes 

Other known genes (gene product or phenotype known) 
Hypothetical, unclassified, unknown 

* 

Table 3, Functional categories suggested for jeast 

{see http://mips.gsf.de/proj/yeast/catalogues/funcat/) 

Metabolism 



Cell' cycle and DMA processing 

Transcription 

Protein synthesis : - 

Protein fate (folding, modification, destination) 

Cellular transport and transport mechanisms 

Cellular communication/signal transduction mechanism 

Cell rescue, defense and virulence 

Regulation of/interaction with cellular environment 

Cell fate 

Transposable elements, viral and plasmid proteins 
Control of cellular organization 
Subcellular localization" 
Protein activity regulation 

Protein with binding function or cofactor requirement (structural or transport facilitation) 
Classification not yet clear cut 
Unclassified proteins 



and a differentiation and maturation mediator (feffery et aL 2000) straddle different classes, so that 
it will be impossible in general to assign individual proteins to unique functional classes. 



4.2 The EC classification 



The best-known detailed classification of protein functions is that of the EC Naturally, die 
EC classification applies only to enzymes. Given our ultimate goal of mapping sequence and 
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structure onto function, it is important to bear in mind the Commission's emphasis that: '//// 
perhaps worth noting as it has been a matter of longstanding confusion, that eiiqyme nomenclature is primarily 
a watter of naming reactions catalysed, not the structures of the proteins that catalyse them. * 

The origin of the EC classification was the action taken by the General Assembly of the 
International Union of Biochemistry (IUB), in consultation with the International Union of 
Pure and Applied Chemistry (IUPAC), in 1955, to establish an International Commission 
on Enzymes. The EC published its classification scheme, first on paper and now on the web 
(sec http:/y\v\v\v.chem.qmul.ac.uk/iubmb/enzymc/). 

EC numbers (looking suspiciously like IP numbers) contain four fields, corresponding to a 
four-level hierarchy. For example, EC 1.1:1.1 corresponds to alcohol dehydrogenase, catalysing 
the general reaction: 

an alcohol-f NAD — the corresponding aldehyde or ketone -r-NADH 2 , 

Note that several reactions, involving different alcohols, would share this number; but that the 
same dehydrogenation of one of these alcohols by an enzyme using the alternative cofactor 
NADP would be assigned EC 1.1.1.2. 

The first number shows to which of the six main divisions (classes) the enzyme belongs: 

Class 1. Oxidorcductascs 
Class 2. Transferases 
Class 3. Hydrolases 
Class 4. Lyases - ' " ~" 
Class 5. Isomcrascs 
Class 6. Ligascs. 

The significance of the second and third numbers depends on the class. For oxidorcductascs the 
second number describes the substrate and the third number the acceptor- For transferases, the 
second number describes the class of item transferred, and the third number describes cither 
more specifically what they; transfer or in some cases the acceptor. For hydrolases, the second 
number signifies the kind of bond cleaved (e.g. an ester bond) and the third number the mol- 
ecular context (e.g. a carboxylic ester or a thiol ester). (Proteinases are treated slightly differently, 
\yhh the third number including the mechanism: serine proteinases, thiol proteinases and acid 
proteinases arc classified separately.) For lyases the second number signifies the kind of bond 
j formed (e.g. C-C or C— O), and the third number the specific molecular context. For isomcrases, 

j the second number indicates the type of reaction and the third number the specific class of 

i reaction. For ligascs, the second number indicates the type of bond formed and the third number 

the type of molecule in which it appears. For example, EC 6.1 for C-O bonds (enzymes acylating 
tRNA), EC 6.2 for C— S bonds (acyl-CoA derivatives), etc. The fourth number gives the specific 
enzymic activity. 

Specialized classifications are available for some families of enzymes; for instance, the 
MEROPS database by N. D. Rawlings and A.J. Barrett provides a structure-based classification 
of peptidases and proteinases (sec http://\vww.mcrops. Sanger. ac.uk/), 

43 Combined classification schemes 

Rison ct al (2000) have compared functional classifications proposed for genomes, Most arc 
hierarchical, so that the authors could make an attempt to merge them into a Combined 
scheme*, from which the various classifications could be compared. Of course the different 
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classifications arc not entirely mutually consistent, requiring compromises in integrating them* 
Their combined scheme is a three-level hierarchy. The top levels arc: 

(1) metabolism; 

(2) process; 

(3) transport; 

(4) structure and organization of structure; 

(5) information pathways; 

(6) miscellaneous. 

The intermediate and lower levels are increasingly more specific. However, in most cases even 
the lower level is fairly general; for instance, in the combined scheme of Rison e/al. (2000), entry 
1,3.1 corresponds to metabolism/small molccules/amino-acid metabolism. 

Rison it al (2000) map different functional classifications onto their combined scheme and 
compare coverage. Some gaps are implicit in the design of individual databases. For instance, 
functions in the general class 'structure' are absent from KEGG - The Kyoto Encyclopaedia of 
Genes and Genomes (Kanehisa et al 2002) - leaving large gaps in its mapping onto the com- 
bined scheme. Some other gaps arise from problems in mapping individual functional classifi- 
cations onto the combined scheme. 

Even this combined scheme does not solve the problem of mapping functions to the level of 
detail desired for protein annotation. The authors recognize that some of the schemes treated 
have much higher functional resolution than theirs, but do not integrate that information. They 
mention but do not treat the EC classification. 

Given the goal of mapping a functional classification onto sequence and structure classifi- 
cations, se\:era] problems associated with current functional categorizations are generally rec4 
ognized. One is that the function is defined without reference to homology in general and 
structure in particular. The EC, for instance, merges non-homologous enzymes that catalyse 
similar reactions. V. a 

Gerlt &. Babbitt (2001), who are among the most thoughtful writers on the subject, pointed 
out that* no structurally contextual definitions of enzyme function exist*. They propose a general 
hierarchical classification of function better integrated with sequence and structure. For enzymes 
they define: 

• Family. Homologous enzymes that catalyse the same reaction (same mechanism, same sub- 
strate specif! cit}'). These can be difficult to detect at the sequence level if the sequence simi- 
larity becomes very low" 

• Sifperf<w/il)\ Homologous enzymes catalysing similar reaction with either {a) different specificity 
or (b) different overall reactions with common mechanistic attributes (partial reaction, tran- 
sition state, intermediate) that share conserved active-site residues. 

• SuprafamiUes. Different reactions with no common feature. Proteins belonging to the same 
suprafamily would not be expected to be detectable from sequence information alone. 

Another problem, that we have already mentioned, is that the traditional biochemist's view of 
function arises from the study of isolated proteins in dilute solutions, in the presence of carefully 
controlled concentrations of substrates. The molecular biologist knows that an adequate defi- 
nition of funcuon must recognize the biological role of a molecule in the living context of a cell 
(or intracellular compartment) or the complete organism on the one hand, and its role in 
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a network of metabolic or control processes on the other (Lan ttal. 2002, 2003). (In addition!! 
the fundamental point of providing a more appropriate definition of function, information abour 
context is often useful in assigning function.) As a result, there is a generic problem with all 
attempts to force functional classifications into a hierarchical format (see comments of Riley 
1998). ■ 

A A The Gene Ontology Consortium 

A more general approach to the logical structure of a functional classification has been adopted by 
the Gene Ontology Consortium (2000) (see http://ww\v.gcnconto!og>^org). Its goal is a sys- 
tematic attempt to classify function, by creadnga dictionary of terms and their relationships for 
describing molecular functions, biological processes and cellular context of proteins and other 
gene products. It supports annotation efforts by providing a set of terms that individual anno- 
tators or databases may adopt. (By an ontology they mean a set of well-defined terms with well- 
defined inter-relationships; that is, a dictionary and rules of syntax.) 

Organizing concepts of the gene ontology project include the distinctions between: 

• Molecular junction. A function associated which an individual protein or RNA molecule does 
in itself; either a general description such as 'enzyme', or a specific one such as 'alcohol 
dehydrogenase 1 . This is function from the biochemists' point of view, 

and . — - 

• Biological process. A component of the activities of a living system, mediated by a protein or 
RNA, possibly in concert with other proteins or RNA molecules; cither a general term such as 
signal transduction, or a particular one such as cyclic AMP synthesis. This is function from the 

. cclKs point of view. ... * ■ . • . ■< 

Because many processes arc dependent on location, gene ontology also tracks: 

• CellularcomfonenL The assignment of site of activity or partners; this can be a general term such 
as nucleus or a specific bne such as ribosome, 

An example of the gene ontology classification is shown in Fig. 5. Note that it is more general 
than a hierarchy. We feel that of the schemes for classification of function that have been 
proposed, only that of the Gene Ontology Consortium has the possibility of linkage to successful 
tests of prediction of protein function. 



5. Methods for assigning protein function 

5. 1 Detection of protein homology from sequence, and its application to function assignment 

If there is a standard method for. predicting protein function, it is the detection of similarity of 
amino-acid sequence by database searching, and assuming that the molecules identified are 
homologues with similar functions. Search engines such as PSI-BLAST pull out sequences 
similar to a query sequence, from general protein sequence databases. The most favourable result 
is to find that the query sequence is identical or very closely related to that of a well-characterized 
protein. However, as we have seen, even in these cases the assignment of function may not be 
correct or complete. The problem of assigning function becomes significantly more complex as 
the similarity between the unknown sequence and its (putative) homologuc falls, except that in 
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Fig. 5. The Gene OntologyiConsortium classification of functions involving DNA metabolism. (From the 
Gene Ontology Consortium, 2000; reproduced with permission.) 

some cases specific sequence signature patterns identify active sites, even in proteins with little 
overall sequence similarity to homologues of known function. Although the hope is that highly 
similar proteins will share similar functions, substitution of a single, critically placed amino acidin 
an active-site residue may be sufficient to alter a protein's role fundamentally. 

. Several groups have tested correlations between sequence similarity and functional similarity. 
One senses a feeling, in the relevant scientific community, that can be roughly stated ns, ; / Yes, 
we know the collections of horror stories about proteins with very closely related sequences 
bur different functions, but those arc rare exceptions, and the inference of function from 
similarity in sequence works fairly well most of the time.* Docs the evidence support this 
assumption? 

Shah & Hunter (1997) determined the sequence similarity of proteins within any EC class. 
They used a sample of 1327 classes and 15 208 proteins, and tested various similarity thresholds. 
Their conclusions were that the errors were dominated by false positives, and that it would be 
better to carry out this kind of analysis at the domain level. 

Wilson ttaL (2000), Todd et ai (2001) and Devos & Valencia (2000) reached similar (although 
not identical) optimistic conclusions. 

Wilson (t aL (2000) conclude that for pairs of single-domain proteins, at levels of sequence 
identity ^40%, precise function is conserved, and for levels of sequence identity ^25% broad 
functional class is conserved [according to a functional classification that uses the EC hierarchy 
for enzymes, and supplements it for material from FLYBASE (Ashburncr & Drysdale, 1994) for 
non-enzymes], Todd et a/. (2001) found that for pairs of proteins, both known to be enzymes, 
slightly <90% of pairs with sequence identity ^40% conserve all four EC numbers. Even 
at >30% sequence identity they found conservation of three levels of the EC hiererchy 
for 70% of homologous pairs of enzymes. Devos & Valencia (2000) reached very similar 
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conclusions ; they also reported the ability to predict correctly the agreement of FSSP categories aS - . ( 

(Holm 8c Sander, 1999) and SWISS-PROT keywords, as a function of the level of sequence ks fc 

similarity. has b ; 

Host (2002), using a wider definition of pairs of sequences identified for comparison - predK 
including shorter matching regions - reached more pessimistic conclusions, entirely at variance prcdic 
with those of other investigators. Of pairs of enzymes with >50% sequence identity, he re- substt 
ported that <30% have entirely identical EC numbers, BLAST E values below 10~ 50 were also tate ( { 

not sufficient to imply identical function. It should be noted that two pairs of proteins with bj nc j ( 

>50% sequence similarity are expected to have very similar overall structures, <t A root- partrx 
mean-square deviation of over 95% of their backbone atoms, and the active sites may be even bogge 
more similar in structure (Chothia & Lesk, 1986). Even for pairs of proteins with over 70% Neva 
residue identity in the optimal alignment (a very close relationship indeed), over 30% do not even inforr 
share the first EC number, that is, the general classification! The implication is that to reason 

successfully from sequence similarity to common function, it is essential to require that the ithms 
similarity extend over a large enough sector of the sequence, as in the studies of Wilson et at. homo 
(2000), Todd tfrf/. (2001) and Devos & Valencia (2000). prim2 

Function prediction from sequence similarity can take advantage of multiple sources of in- midni 
formation to back up the prediction from levels of sequence identity alone, and to improve the not a 

results in cases of lower sequence similarity than the ~40% identity confidence threshold f\ tt 
proposed by Wilson et aL (2000), Todd et aL (2001) and Devos & Valencia (2000). rC p rc ; 
* Having identified putative homologucs, multiple sequence alignments enable identification of pj c fc y 

conserved residues, the literature may provide crucial information about the family as a whole home 

and the role of conserved residues, and phylogcnetic trees can provide information as to whether anc | t 

. an unknown protein clusters with a particular functional grouping (Hanncnhalii & Russell, 2000; seque 

Gu & Vander Veld en, 2002). In general, if an unknown protein shares significant sequence struct 

similarity. with a family of known function, possesses the 'right essential: conserved residues* irnpoi 

(e.g. active- site residues), then a prediction as to function (proteinase; cxonuclcase,.etc.) can .1 powc 

reasonably be proposed. In addition, if the unkriowri also forms part of a well-supported func- If < 

tional cluster or clade within a phylogcnetic tree then a more detailed level of functional home 

prediction may be possible. quest 

Hanncnhalii & Russell (2000) examined nucleotidyl qclases. Changing the specificity between accep 

an ATP cyclase and a GTP cyclase requires mutations of only two residues E937K and C1018D. funct: 

From a common alignment of ATP and GTP cyclases, they were able to identify residues the ui 

correlated with the change in specificity, including the two crucial positions. Given the sequence have 

of a new enzyme in this family, it could be identified as a family member by overall sequence aspec 

similarity, and its specificity could be inferred from the residues occupying the selected positions. likely 

Hanncnhalii & Russell (2000) also showed that a similar analysis permitted prediction of speci- natur 
ficity of protein kinases [Motifs were already known that were able to distinguish Scr/Thr from * the A 
Tyr kinases (Hanks & Hunter, 1995; Hanks et aL 1988)]. Dc 

As a control, an illustration of a negative inference: an evolutionary tree of myotubularin- » know 

related proteins permitted Nandurkar et al. (2001) to infer that their protein, although related to ! matel 

active phosphatases, lacked the essential catalytic residues and acts as an adapter rather than an seque 

enzyme. may ; 

Even in the event of a smooth path to successful prediction as oudined above, more questions j et a/* 

may be raised than answered. Let us consider an example where we are able to identify an consc 

'unknown' protein as a proteinase through sequence similarity. Immediately the question arises mode 
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as to the target of the proteinase (i.e. the physiological substrate), and in addition, what (if any) is 
its physiological inhibitors) or binding partners). It may be possible (if a representative structure 
has been determined) to build molecular models of the unknown proteinase and make basic 
predictions regarding substrate specificity by examining the nature of the residues lining the 
predicted SI subsite. However, this a far cry from being able to predict accurately the plysiologKctl 
substrate®, and thus the biological function; Similar problems exist when attempting to anno- 
tate functionally unknown proteins that belong to protein families the primary role of which is to 
bind other proteins or small molecules - often it is difficult to predict the nature of the binding 
partner. Thus it appears that relatively straightforward function prediction problems can get 
bogged down relatively early by questions difficult to answer by common tools of bioioformatics. 
Nevertheless, even the basic prediction that an unknown protein is a proteinase is valuable 
information that may guide and accelerate experimental study. 

More sensitive database searching engines such as PSI-BLAST and SAM3.0 and other algor- 
ithms utilizing profile hidden Markov models (HMMs) allow identification of putative distant 
homologies. Often these engines arc able to detect such similarity in spite of extremely low 
primary sequence identity (well below the twilight zone - 10-25% sequence identity- into the 
midnight zone, below 10%), At this level of similarity it is crucial to be able to judge whether or 
not a match is real and various methods arc used to minimize the number of false positives. 

Aravind & Koonin (1999) argue that the sequences picked up by sequence similarity sequences 
represent genuine homologucs, on the grounds that current sequence search methods do not 
pick up even all "the proteins known (from structural and other considerations) to be genuine 
homologucs. Of course this is a comment on the state of current sequence searching techniques 
and rhc recommended threshold values applied in their use. It may be that more powerful 
sequence similarity Detection programs may in the future pick up sequences that fold into similar 
structures but are related by convergence father than homology. The conclusion is that it is 
important to keep recalibrating the methods in use. and- paradoxically - as they grow more-; 
powerful; tO' become more ; cautious in interpreting their results. : v 

If even close relatives often do not share functions, does the identification "of distant putative 
homologucs facilitate functional prediction, or is it a "fruitless pursuit? Again, the answer to this 
question depends on the value placed on a particular threshold of statistical significance in 
accepting an inference. At best identification of distant similarity to a protein family of known 
function may suggest a function and allow identification of active-site residues and assignment of 
the unknown to a general functional class. Even if the relationship is genuine, the unknown may 
have evolved far from its putative distant homologue. Although it is likely that some general 
aspects of the mechanism may be common to an unknown and a distant homologue (particularly 
likely if active-site residues arc retained), it is quite possible that fundamental changes in the 
nature of the substrate may have occurred (e.g. from lipid phosphate to DNA in the case of 
the AP endonucleases). 

Detection of homologucs may provide one or more relatives for which the 3D structure is 
known. This provides another level of information and another test of the prediction. Such a 
match with a protein of known structure enables a molecular model to be built. Although if the 
sequence similarity is low the quality of the model may also be low, even an approximate model 
may allow the compatibility of the unknown sequence with the fold to be assessed (Schonbrun 
et aL 2002). Furthermore, because the active sites of enzymes often comprise the most highly 
conserved and structurally similar regions it may be possible to build a surprisingly detailed 
model around the active site, even if overall sequence similarity is low. The two examples given 
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in the introduction from the structural genomics of Haemophilus hiflsieatpe illustrate the experience 
that this approach sometimes works and sometimes docs not. 

Even if the results of an experimental structure determination are not available, theoretical 
methods of structure prediction may be useful in identifying putative remote homology 
(Schonbrun ct al 2002; Tramontano, 2003; Kinch et al In Press). 

The situation for multi-domain proteins is even more complex. Although it may be relatively 
straightforward to predict the role of some of the domains using the methods described, others 
may prove more challenging. Thus a complete functional description of a multi-domain protein 
of unknown function may be limited if it contains one or more domains that cannot be accu- 
rately annotated. Furthermore the possibility of domains acting in concert with one another to 
modulate the behaviour of the complete molecule is difficult to predict, 

5.2 Detection of structural similarity, protein structure classifications, and structure/function 
correlations 

It is well known that structure changes more conservatively than sequence during evolution. 
There are many cases of distandy related homologucs assignable from shared structures with no 
recognizable relationship between the sequences. The 3D analogue of sequence alignment is 
alignment by structural analog)': establishment of correspondences between pairs of residues 
that occupy the same geometric positions in two protein structures. Many- algorithms have been 
implemented for this task (reviewed by Koehlj 2001). 

DALI (Holm & Sander, 1993) is based on the observation that intcr-rcsidue contact patterns 
are among the best preserved features of protein structures (Lesk & Chothia, 1980). The DALI 
web server (see http://ww>v.cbi. ac.uk/dali/) will screen a novel protein structure against the 
Protein Data Bank and report the most similar structures and the alignment of the sequences. 
: DALI is used. routinely by X-ray crystallographcrs and NMR spectrpscopists to provide a pre r 
li mi nary classification of each new structure. ;. 

Several authors have applied the known structures to infer homology among proteins 
too distandy related to be identified as homologucs from the sequences alone. They have 
created databases merging structures, sequences and the greater reliability of homology detection 
and alignment attainable by use of structural information (Holm & Sander, 1999; Przytycka 
etal 1999; Aloy et al 2002). 

A hierarchical structural classification of protein domains of known structure, based on the 
DALI program, is available on the web (Holm 6c Sander, 1999). Two other major databases of 
classifications of protein~structurcs arc the Structural Classification of Proteins (SCOP) (Murzin 
etal 1995; Lo Conte et al 2002) and CATH (Pearl et al 2003). There are many others, tabulated 
in Ouzounis etal (2003). SCOP depends crucially on manual curadon by A. G, Murzin. CATH is 
based on a structural-alignment program, SSAP (Taylor & Orcngo, 1989). Most classification 
schemes for sequences and structures arc expressed as hierarchical clusterings. The most similar 
items are grouped together at the lowest level. The sets of linked items are progressively merged 
to form successive levels of the hierarchy. For instance, the SCOP database has as its basis 
individual domains of proteins. Sets of domains arc grouped into families of homologucs, for 
which the similarities in structure, sequence, and sometimes function imply n common evol- 
utionary origin. Families containing proteins of similar structure and function, but for which 
the evidence for evolutionary relationship is suggestive but not compelling, form superfamilies* 
Superfamilics that share a common folding topology, for at least a large central portion of 
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the structure, are grouped as folds. Finally, each fold group falls into one of the general classes. 
The major classes in SCOP arc a, fi t a+0, a/fi> multi-domain proteins, membrane and cell 
surface proteins, and miscellaneous small proteins, which often have litde secondary structure 
and are held together by disulphidc bridges or ligands. 

Several groups have attempted to correlate protein structure and function (Hcgyi & Gerstein, 
1999; Thornton etal 1999). Hcgyi & Gcrstein (1999) correlated the enzymes in the yeast genome 
between their fold classification in SCOP (Lo Contc el al 2002) and their EC functional cat- 
egories, via the annotations in SWISS-PROT. They identified 8937 single-doniain proteins that 
could be assigned both a fold and a function. 

The broadest categories of structure were from the top of the SCOP hierarchy, including the 
all-a, all-/?, a/fi, a-*-/?, multi-domain, and small classes. The broadest categories of function 
were from the top of the EC hierarchy: oxidoreductases, transferases, hydrolases, lyases, iso- 
mcrascs and ligascs; plus an additional category, non-enzymes. There arc therefore 6 (structural 
classes) X 7 (functional classes) = 42 possible combinations of highest-level correlates. By using 
finer classifications of structure and function (down to the third level of EC numbers) there arc a 
total of 21 068 potential fold-function combinations. Only 331 of these are observed, among the 
8937 proteins analysed. 

The observed distribution is highly non-random. Non-cnzymic functions account for 59% of 
the sequences of which well over half are in the all-a or all-/?: fold category. Of the enzymes, the 
most popular combinations were a/fi folds among oxidoreductases and transferases, and all-/? 
and a +/3 hydrolases. 

Knowing the structure of a domain, what can be inferred about its function? Many folds arc 
compatible with very different activities. The five most 'versatile' folds arc the TIM barrel, a— fi 
hydrolase, the NAD-binding fold, the P-loop-containing NTP hydrolase fold, and the fcrrcdoxin 
fold.. Conversely, the functions, carried out by the most different types :of structure arc glycosi- 
dases and carboxylases. These two : functions are carried out by seven different fold .types, from 
three different fold classes. . - b 

What we. are looking for, however, are cases where structure provides reliable clues to func- 
tion. In their cross table, Hcgyi & Gcrstein (1999) show several folds that appear in combination 
with only one function. These appear to have predictive significance for function. Of course one 
cannot tell whether this is just because they are rare folds, and whether the correlation will hold 
up as the databases grow. 

5.3 Functjon prediction from amino-acid sequence 

Despite the progress in structural genomics projects, most proteins encoded in newly sequenced 
genomes are known from their amino-acid sequences alone. A major problem in genome an- 
notation is that of assigning their functions. Note that not only are the 3D structures in most 
cases unknown, there is generajly no information even about cofactors or post-translational 
modifications, which are often essential for function. 

There are two basic approaches to prediction of protein function from amino-acid sequence 
alone, focused on (1) overall sequence similarity and (2) signature patterns of active sites, or 
motifs (Bork &: Roonin, 1996). We have already discussed the standard method based on the 
assumption that in at least many cases evolutionary divergence is slow enough to permit recog- 
nition of homologues that may have the same or at least similar structures and functions. Often 
the general similarity of sequences reflects a similarity in overall folding pattern, and particular 
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residues within the fold may form a localized active site. Clearly the conservation of active-site 
residues is important in reasoning from sequence similarity to functional similarity. Indeed, in 
some cases it is possible to cut short the reasoning and to recognize the residues comprising the 
active site from a specific signature pattern or motif within the sequence. However, although 
many motifs do reflect functional active sites, others reflect positions for post-translarional 
modification (e,g. glycosylation sites), or structural signals (e.g. N and C caps of a-helices), or 
signal sequences, with no direct functional implications. 

Attwood (2000) has described general methods for deducing sequence patterns. All start with 
(or produce) a multiple sequence alignment, and seek to identify common distinctive features of 
particular positions of the sequence. These features may involve: 

(1) A mo rif describing a single consecutive set of residues. 

(2) Multiple motifs — a combination of several moufs involving separate consecutive sets of 
residues. 

(3) Profile methods, based on entire sequences and weighting different residue positions ac- 
cording to the variability of their contents. Extensions and generalizations of profile 
methods, including HMMs, arc among the most sensitive detectors of distant homology 
based entirely on sequence data that we have. 

• 

53, 1 Databases of single motifs 

Moufs may be expressed in terms of uniquely defined sequences, such as 
GW'TLNSAGYLLGP, 

which characterizes the neuropeptide galanin. Or, motifs may contain alternative residues; for 
'instance tLIViMF]-f-I-T-P-P^F\^ the signature of N-4 cytosine-specific DNA methylases. Here 
[LIVMFJ means that that first position may -^contain. /7/£y. of the amino acids L, I, Y, M, or F #: 
followed byVthe unique sequence TTPP, followed by a position that may contain cither F or Y. 
It is easy toiindicatc a- site which excludes a specific amino acid by bracketing tbc other 19, or v 
b)' using the notation {P} to indicate 'any amino acid except proline'. Motifs can contain 
'wild cards' (which permit any of the 20 amino acids at a position) and 'spacers'; for instance, 
L-x(6)-L-x(6)-L-x(6)-L, the signature pattern of the leucine zipper which appears in some 
eukaryotic transcription regulator proteins. The pattern specifies four leucines each separated by 
six residues each of which may be any amino acid. More generally, a signature pattern may be 
specified by a 'regular expression', which allows for a wider range of alternative patterns and 
variable distances between residue positions. It is simplest to search for exact matches to the 
patterns, but algorithms that allow for some mismatches arc available (see e.g. Gusficld, 1997; 
Crochcmorc & Ryt ter, 2003) . 

Attempts to apply data mining techniques to pattern discover)* in biological sequences are by 
now a heavy industry with an enormous literature (sec e.g. Floratos et al. 2001). 

One very important set of results of this kind of work, PROSITE (Sigrist et h/. 2002) 
contains a collection of motifs covering a wide range of groups of proteins, together with 
retrieval software to check a submitted sequence for the presence of one or more moufs. 
The motifs are calibrated to indicate the number of false negatives and positives to be expected. 
The [LlVMPj-T-T-P-P-JPi 7 ] motif detects nil N-4 cytosinc-specific DNA methylases, but- 
also picks up false positives. The L-x(6)-L~x(6)-L-x(6)-L motif is least specific, missing one 
known leucine zipper (L-myc, which contains a methionine instead of one of the leucines) 
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and promiscuously picking up hundreds of other sequences from- many different types 
of proteins. 

Thornton et ai (1999) have investigated the structural implications of conserved sequence 
motifs. Typically these are involved in conserved substructures contributing to a common 
function. Kasuya & Thornton (1999) have confirmed that PROSITE motifs reflect common 3D 
structural patterns by analysis of protein structures in which they appear. Kasuya & Thornton 
(1999) found examples among proteins of known structure of 553 of the 1265 PROSITE 
patterns available at the time of their work. In most cases the residues matching a given PRO- 
SITE pattern in different proteins had similar 3D structures as measured by the root-mean* 
square deviation of the Ca atoms. Some of the exceptions observed are biologically interesting. 
For instance, among the matches to the 12-residue TRYPSIN_SER pattern that includes the 
acrive-site serine of the trypsin family of serine proteinases 

[DNSTAGq-fGSTAPIMVQH^ 
-[LIVMFYSTANQH] 

outliers in conformation space included proenzymes, for which it is known that the region 
matching the pattern undergoes conformational change upon activation. 

Todd et ai (2001, 2002) have collected cases of homologous enzymes, some but not all of 
which catalyse the same reaction, in which residues equivalent in. their contribution to catalysis 
appear at non-equivalent" positions in the active site. Examples include human alcohol dehy- 
drogenases in classes \ft and Ilty, which have 62% overall residue identity in their sequence 
alignment, but in which the active site Thr and His appear in different sequence patterns : 48Thr- 
49Asp-$0Asp-51 His or 47His-48Thr; and /^-lactamases in classes A and C, in which the catalytic 
residues appear on different structural elements. : ' ,< v ; t l 

Several authors have sought to extend motif searching to three dimensions. Given that motifs 
tend to^correspond to regions of,,conscrvcd structure linked function, Wallace et a/. (1996) : 
searched known protein structures for the Scr-His T Asp catalytic triad of trypsin-likc serine 
proteinases. The identified all known serine proteinases in their datasct, plus triglyccroi lipases 
which share the catalytic triad, 

de Rinaldis <t af. (1998) derived 3D profiles from a single protein structure or a set of aligned 
structures. They applied their results to identifying proteins with matching surface patches. 
Analysis of the 3D profiles of ATP and GTP binding P-loop proteins identified a positively - 
charged phosphate-binding residue (Arg or Lys) in a position conserved in space but not in 
sequence. - 

In a similar approach, Jackson & Russell (2000, 2001) have identified regions with con- 
formations similar to those of PROSITE motifs, but not necessarily sharing sequence similarity 
with them. They were able to identify serine proteinase inhibitors that contain regions similar in 
conformation to the loops in known inhibitors that have a common structure that docks to the 
proteinase. 

53.2 Databases of profiles 

Given a multiple sequence alignment, it is usually the case that some positions show high 
variability while others show high conservation. To detect other sequences that share the pattern, 
a weighted alignment of a target sequence with the alignment tabic can be carried out, giving 
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higher weight to matches at highly conserved positions.' Profiles could alternatively be based on 
the local regions of high conservation that went into motifs. 

Associated with PROS1TE is a compendium of profiles characterizing entire domains. Be- 
cause the matching of such profiles is sensitive to the sequences of enure domains, it is less likely 
to return false positives; but because the information contained in the most conserved part of 
the sequences is eroded, it may lose sensitivity relative to.motif matching. 

An alternative approach to describing a set of homologous sequences is HMMs (Eddy, 1996). 
HMMs represent successive positions in a probabilistic way. They arc more general than simple 
profiles, and do a better job of discriminating homologues from non-hornologues, provided that 
they arc trained with correct alignments, HMMs currently provide the most sensitive methods 
for detecting distant homologues given only the amino-acid sequence of a query protein. 

Pfam is a database of mul tiple alignments of protein domains, and the HMMs built from them 
(Batcman tt al 2002). Search software permits detection of whether a query sequence belongs to 
any of the families in Pfam. 

The Supcrfamily database is a library of HMMs for all proteins of known structure (Gough 
etai 2001). Its goal is to identify, from protein sequences, domains with folds corresponding to 
one or more known structures. 

5.3.3 Databases of multiple motifs 

We have pointed out that motifs may be more specific than profiles because they focus on well- 
conserved active sites. But a weakness of single-motif patterns is that an active site of a protein 
may be defined by regions that arc distant in the sequence although nearby in space. Single-motif 
patterns are also necessarily based on characteristics of single domains, whereas it may be useful 
{ to identify proteins by the presence of more than one domain. Multiple-motif databases aim to-: 
remedy <thcse problems. ; v V- <; 

BLOCKS (HeriikofT et <?/.• 2000) and PRINTS (Attwood, 2Q02; Atcwood etai 2002a, 2003) are.:; 
databases of.mlittipie motifs, typically .,^20 residues long, presented in the form : of unmapped 
multiple sequence alignments. PRINTS but not BLOCKS contains biological documentation of 
the significance of the motifs. Search software can identify matches to individual motifs in a 
query sequence. There is flexibility to define how many of the motifs match, to what stringency, 
to define a *hit\ 

If different sets of sequences match different individual motifs one has the additional possi- 
bility of classifying subsets of a family of homologues, and inferring evolutionary trees. For 
instance, Attwood (2001, 2002) has used motifs to classify the important family of G protein- 
coupled receptors (GPCRs), a large family of cell-surface proteins that detect and signal hor- 
mones and growth factors, and mediate the senses of sight and smell. Particular motivation for 
classifying subtypes is the fact the GPCRs are common drug targets. Potential for improvements 
in specificity would have important clinical consequences. 

The PRINTS database contains a seven-motif fingerprint for GPCRs — each motif corre- 
sponding to one of the transmembrane helices. Additional sets of motifs identify subfamilies of 
GPCRs and receptor subtypes. Some but not all of these motifs overlap the general family 
fingerprint. Mapping of the motifs onto the structure of rhodopsin shows what structural 
features distinguish the subclasses (Attwood ct ai. 2002b). 

To apply these databases to prediction of protein function, it should be kept in mind that 
profiles or HMMs arc sensitive to overall folding pattern, sometimes at the expense of focus on 
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Table 4. Some of the databases of protein family classifications 



Database 



Contents 



Reference 



Primarily sequence based 

BLOCKS -f Families 

COG Families 

HSSP Protein families including proteins 

InterPro Families/domains 

Pfam HMM-based families 

PIMA Domains 

PIR-ALN Domains, families, superfamiles 

PRINTS Families 

iProClass Domains, families, supcrfamtlics 

ProDom Domains 

PROSITE Families 

ProtoMap Families 

PROT-FAM Domains, families, supcrfamilies 

SI3ASE Domains of known structure 
Hierarchical protein structure classifications 

SCOP Domains 

CATH Domains 

DALI domain Domains 
dictionary 



HenikorT et aL (2000) 
Tatusov*/*/. (2001) 
Holm & Sander (1999) 
Mulder rial (2003) 
Datcman et aL (2002) 

Srimvnsarao (1999) 
Atuvood (2002) 
Huang ti aL (2003) 
Servant et aL (2002) 
Sigristtfrf/ (2002) 
Yona etaL (2000) 
Mewcs e/aL (1997) 
Vlahovicekf/tf/. (2002) 

Lo Come etaL (2002) 
Orcngo etaL (2002) 
Dietmann & Holm (2001) 



specific active-site residues. Conversely, some motifs arc sensitive to active-site residues but in 
their insensitivity to features of the sequence as a whole may pick up non-homologous proteins 
as false positives. ■ 

• Among these classes of method, a combination of a profilc and motif march would therefore. 
|cem to be the most reliable criterion for function assignment (see Chen & Jcong, 2000), .. 

f 5.3,4 Precompiled families ■ * . .: , t , ... , 

Several groups have applied tools for sequence matching to full sequence databases, or used 
structural similarity, to classify proteins (Tabic 4), Note that the exact definitions of the cat- 
egories vary among the databases. 

InterPro is an umbrella database that attempts to integrate the contents, features, and anno- 
tation of several individual databases of protein families, domains, and functional sites (Mulder 
et aL 2003). It subsumes, but is not limited to, information from PROSITE, Pfam, PRINTS, 
SMART and ProDom databases, and contains links to others including the Gene Ontology 
Consortium functional classification. It intends to assimilate additional databases, including 
structural databases. Resistance is futile. 

An InterPro entry is a description of a protein family, domain, repeat, or site of post-trans- 
lational modification, and links to other databanks, and original literature. Annotations from 
the source databases arc merged. Each entry includes links to relevant terms from the Gene 
Ontology Consortium classification schemes. 

53.5 Function identification from sequence by feature extraction 

Although information'about function must be contained implicitly in amino-acid sequences, it is 
obscure. It can be seen that, even using structure as an intermediate stepping-stone between 
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sequence and function docs not satisfactorily resolve the problem. Brunak and his colleagues 
have examined an alternative intermediate between sequence and function (Jensen et aL 2002). 
They reasoned that information about function should be contained in a spectrum of features of 
proteins, including secondary structure, post-translational modifications, protein sorting, and 
general properties of the aminoacid composition such as the isoelectric point, Using neural 
networks they predicted the following features from protein sequences, and correlated the results 
with functional classes: 

• • extinction coefficient; 

• grand average hydrophobia ty; 
« number of negative residues; 

• number of positive residues; 

• O-glycosylation ; 

• serine/threonine phosphorylation; 

• tyrosine phosphorylation; 

• N-glycosylation; 

• PEST-rich regions; 

• secondary structure; 

• subcellular location; 

• low com plexity regions; 

• signal peptides; 
transmembrane helices, 

They recognized that the predictions of the features would be imperfect, but this need not fatally 
degrade their prediction of function. 

Thejcombiricd networks were trained to recognize a general set of functional classes based on 
categories originally defined by Riley (1993), arid, within the proteins predicted to be enzymes, 
the EC classification. As a 1 measure of the quality of the results, for the general categories^ at a 
level of thresholding giving 70'% correct predictions, the range of false positives varied from 
below 10% to below 40%, with most categories giving about 20% false positives. (A sensitivity 
of 70 with 20% false positives means that if a large number of novel sequences are submitted to 
the procedure, and this set of sequences contains 100 examples of proteins in some functional 
class, the network will report that 90 of the proteins arc in that functional class; 70 of the 
predictions will be correct and 20 will correspond to proteins outside the functional class.) 

By analysing the networks, Jensen et al (2002) were also able to analyse which particular 
combinations of features were the most effective signals for specific functional types. 



5.-4 Methods making use of structural data 

Several groups have developed methods . to apply structural information, in most cases in com- 
bination with sequence information, to interpret function. 

Shapiro & Harris (2000) and Teichmann et a/. (2001a) illustrate the power of structure, 
including but not limited to identifying distant relationships not derivable from sequence com- 
parisons. 

(1) Identification of structural relationships unanticipated from sequence can suggest similarity 
of function. The crystal structure of AdipoQ, a protein secreted from adipocytes, showed 
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a similarity of folding pattern to that of tumour necrosis factor. The inference that AdipoQ 
is a cell-signalling protein was subsequendy verified. 

(2) The histidinc triad proteins are a broad family with no known function. Analysis of their 
structures indicated a catalytic centre and nucleotide-binding site, identifying them as a 
nucleotide hydrolase. Note that this did not depend on detection of a distant homology. 

(3) Structural similarity of a gene product of unknown function from Mctbanomats janmsclm and 
other proteins containing nucleotide-binding domains led to experiments showing it to be a 
xanthine or inosine triphosphatase (Hwang et al 1999). 

Like most sequence-based methods, these structure-based methods proceed by searching for 
homologies. It is well known that distant homology is frequendy more easily detectable in 
structure than in sequence. Howcver, one must recognize diat the more distant the relationship, 
the less reliable the inference of common funcdon. In general, structure does not permit un- 
ambiguous assignment of a precise function, but can provide guidance to experiments that can 
do so. 

Several groups have attempted to determine the common functionally active site of a family 
of proteins. Lichtarge et al. (1996a) have developed an evolutionary trace method to define bind- 
ing surfaces common to protein families. They extract functionally important residues from 
sequence conservation patterns and map them onto the protein surface to identify functional 
clusters. 

Given a set of homologous sequences, and at least one structure, the goal of the evolutionary 
trace method is to identify surface sites implicated in function. The assumptions of the 
method are: 

(1) The set of proteins has a common surface-exposed active site. 

(2) The homologous sequences produce similar structures, that retain the location in molecular 
spice of the active site. 

(3) The functional site is less subject to mutation that average surface- sites. .. 

(4) Those mutations in the functional site that do occur arc riot random but create discrete sets 
of structures with shifts in function (sec also Golding & Dean, 1998; Gu, 1999). 

The method begins by forming a multiple sequence alignment, from which die molecules are 
hierarchically clustered into a tree. By choosing different levels in the hierarchy, clusters of 
different size may be extracted. If different functions are known in the family, the clusters arc 
chosen to reflect subgroups with different function. By choosing larger or smaller clusters, 
grosser or finer resolution in function distinction may be made. For each cluster in the partition, 
form a consensus sequence alignment. Then co-align all the consensus sequences. The residues 
can be divided into (a) those that arc absolutely conserved, (b) those that are conserved within 
clusters but differ between clusters, and (c) unconsented positions. By mapping the conserved 
residues onto the structure, a pattern is observed that defines a surface patch predicted to 
correspond to the active site. 

Lichtarge et al. (1996a) applied their method to SH2 and SH3 signalling domains, and the ' 
DNA-binding domain of nuclear hormone receptors. Their results correctly identified the 
known functional sites in these molecules. If the evolutionary trace method depended on a 
classification induced by known functional divergence, as in these test cases, ir would be arguable 
that it was really a method for assigning structure to function rather than function ro structure. 
However, it can be applied using trees from other sources, and the classifications they induce; for 
instance, those based solely on multiple sequence alignments. 
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Successful predictions by the evolutionary trace method include identification of the func- 
tional surface in families of G protein a-subunits (Lichtarge et ai 1996b) and regulators of G 
protein signalling (Sowa etal. 2000, 2001). Both cases were blind predictions subsequently verified 
by experiment The success of the evolutional}' trace method has led to its being taken up and 
developed by a number of groups (Aloy et al 2001 ; lichtarge & Sowa, 2002; Madabushi et ai 
2002; Yaoetal 2003). 

Irving tial. (2001) applied the idea that active sites tend to be among the structurally best 
conserved parts of a protein, by using superposition methods to extract regions of the lowest 
root-mcan-squarc deviation of Ca atoms in a pair of proteins of known structure. They tested 
the method on a pair of proteins - YabJ from A subtilis (PDB entry lqd9) and YjgF from & coli 
(lqu9) - related to chorismatc mutase. Without using any information from chorismate mutase, 
their program suggested that YabJ and YjgF share an active site, which occupies a similar region 
of their structures as the active site of chorismate mutase. 

It should be emphasized that identification of an active site is not per se an identification of 
function, but an important step towards one. Once a binding site is targeted, the identification 
of a ligand is a computationally, the same problem faced in drug design, for which a great deal of 
mature algorithms and software exist (Finn & Kavracki, 1999). 

Moreover, the mode of binding of a ligand does not always correlate with sequence or 
structural similarity. Cappello tt ai (2002) studied the mode of binding of the adenine ring in 
different proteins. Their conclusion was that proteins with similar folds can bind adenine in 
different ways, and (interesting but less relevant for possible methods for function prediction) 
proteins with dissimilar structures and functions can bind adenine in similar ways. 

J>> Applications of full-organism information: inferences from genomic : 
context and protein interaction patterns 

For proteins encoded in complete genomes; approaches to function prediction making use of 
^contextual information and intergehomiexomparisons are useful (Marcottc et at. 1999 ; Huyncn 
et ai 2000 ; Huyncn & Sncl, 2000 ; Kolesov et ai 2001 , 2002). 

(1) Gem ftuhtL A composite gene in one genome may correspond to separate genes in other 
genomes. The implication is that there is a relationship between the functions of these genes. 

(2) I^calgene context. It makes sense to co-regulate and co-transcribe components of a pathway. 
In bacteria, genes in a single operon are usually functionally linked. 

(3) Interaction patterns. As part of the development of full-organism methods of investigation, data 
are becoming available on patterns of protein interactions (Xenarios et ai 2002). The network 
of interactions reveals the function of a protein. 

(4) Phjlogenetic profiles. Pellegrini et ai (1999) have exploited the idea that, proteins in a common 
structural complex or pathway are functionally linked and expected to co-cvolvc. For each 
protein encoded in a known genome, they construct a phylogcnctic profile that indicates 
which organisms contain a homologuc of the protein in question. Clustering the profiles 
identifies sets of proteins that co-occur in the same group of organisms. Some relationship 
between their functions is expected. 

For instance, B. coli ribosomal protein RL7 has homologues in 10 out of 11 eubacteria] 
genomes, but no homologuc appears in an archaeal genome (Pellegrini et ai 1999). Most of the 
E. coli proteins that share the phylogcnetic profile of RL7 have ribosome-associatcd functions. 
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If the function of RL7 were unknown one could infer that it is associated in some way with the 
ribosome. Comparison of keywords in SWISS-PROT annotations affords a general test of 
this approach. Of sets of nonhomologous proteins with similar phylogenetic profiles had, on 
average, 18% of SWISS-PROT keywords in common. 

There need be no sequence or structural similarity between the proteins that share a phylo- 
genetic distribution pattern. One unusual and very welcome feature of this method is that it is 
one of the few that derives information about the function of a protein from its relationship to 
itofhbomolog>tn proteins (Marcottc ttaf. 1999; Pellegrini etai 1999). 



7, Conclusions 

The problem of prediction of function from amino-acid sequence and protein structure is far 
from being satisfactorily solved. 

Some problems arc hard only because they arc difficulty others arc hard because they are both 
difficult and messy, The prediction of protein structure from amino-acid sequence is difficult, but 
we know that nature has an algorithm and all wc have to do is find it, and given any procedure we 
can easily decide whether the answer is correct or not. The predicrion of protein function is 
messy, partly because funcdon is a fuzzy and multi-faceted concept, and partly because very small 
(or even no) changes in amino-acid sequence are compatible with large changes in function. 

It appears that the most general classification of function is that produced by the Gene 
Ontology Consortium. Their results have the advantage of being appropriate to both biochem- 
istry and biology, at the expense of greater logical complexity. 

Many of the methods that have been applied to function prediction work part of the time but 
none is perfect., Moreover, the more expert the analysis, of .the results,*app]ied> the better the 
predictions arcThis makes it difficult to envisage a purely 'black-box' automatic annotation 
machine for new whole-genome sequences. In most cases* predictions --suggest, but do not 
determine, t^c general class of function. Their most usefurcRcct \i to guide; investigations in the 
laboratory to confirm, or refute, the prediction, and, even if correct, to define the function in 
greater detail. 

Wc conclude that predictions arc useful bur no substitute for work in the laboratory. Indica- 
tions from theory may indict, but only experimental evidence can convict. 
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